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Abstract 

This  paper  reports  some  experiments  that  com¬ 
pare  the  accuracy  and  performance  of  two 
stochastic  parsing  systems.  The  currently  pop¬ 
ular  Collins  parser  is  a  shallow  parser  whose 
output  contains  more  detailed  semantically- 
relevant  information  than  other  such  parsers. 

The  XLE  parser  is  a  deep-parsing  system  that 
couples  a  Lexical  Functional  Grammar  to  a  log- 
linear  disambiguation  component  and  provides 
much  richer  representations  theory.  We  mea¬ 
sured  the  accuracy  of  both  systems  against  a 
gold  standard  of  the  PARC  700  dependency 
bank,  and  also  measured  their  processing  times. 

We  found  the  deep-parsing  system  to  be  more 
accurate  than  the  Collins  parser  with  only  a 
slight  reduction  in  parsing  speed.' 

1  Introduction 

In  applications  that  are  sensitive  to  the  meanings  ex¬ 
pressed  by  natural  language  sentences,  it  has  become 
common  in  recent  years  simply  to  incorporate  publicly 
available  statistical  parsers.  A  state-of-the-art  statistical 
parsing  system  that  enjoys  great  popularity  in  research 
systems  is  the  parser  described  in  Collins  (1999)  (hence¬ 
forth  “the  Collins  parser”).  This  system  not  only  is  fre¬ 
quently  used  for  off-line  data  preprocessing,  but  also 
is  included  as  a  black-box  component  for  applications 
such  as  document  summarization  (Daume  and  Marcu, 
2002),  information  extraction  (Miller  et  al.,  2000),  ma¬ 
chine  translation  (Yamada  and  Knight,  2001),  and  ques¬ 
tion  answering  (Harabagiu  et  al.,  2001).  This  is  be- 
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cause  the  Collins  parser  shares  the  property  of  robustness 
with  other  statistical  parsers,  but  more  than  other  such 
parsers,  the  categories  of  its  parse-trees  make  grammati¬ 
cal  distinctions  that  presumably  are  useful  for  meaning- 
sensitive  applications.  For  example,  the  categories  of 
the  Model  3  Collins  parser  distinguish  between  heads, 
arguments,  and  adjuncts  and  they  mark  some  long¬ 
distance  dependency  paths;  these  distinctions  can  guide 
application-specific  postprocessors  in  extracting  impor¬ 
tant  semantic  relations. 

In  contrast,  state-of-the-art  parsing  systems  based  on 
deep  grammars  mark  explicitly  and  in  much  more  de¬ 
tail  a  wider  variety  of  syntactic  and  semantic  dependen¬ 
cies  and  should  therefore  provide  even  better  support  for 
meaning-sensitive  applications.  But  common  wisdom  has 
it  that  parsing  systems  based  on  deep  linguistic  grammars 
are  too  difficult  to  produce,  lack  coverage  and  robustness, 
and  also  have  poor  run-time  performance.  The  Collins 
parser  is  thought  to  be  accurate  and  fast  and  thus  to  repre¬ 
sent  a  reasonable  trade-off  between  “good-enough”  out¬ 
put,  speed,  and  robustness. 

This  paper  reports  on  some  experiments  that  put  this 
conventional  wisdom  to  an  empirical  test.  We  investi¬ 
gated  the  accuracy  of  recovering  semantically-relevant 
grammatical  dependencies  from  the  tree-structures  pro¬ 
duced  by  the  Collins  parser,  comparing  these  dependen¬ 
cies  to  gold-standard  dependencies  which  are  available 
for  a  subset  of  700  sentences  randomly  drawn  from  sec¬ 
tion  23  of  the  Wall  Street  Journal  (see  King  et  al.  (2003)). 

We  compared  the  output  of  the  XLE  system,  a 
deep-grammar-based  parsing  system  using  the  English 
Lexical-Functional  Grammar  previously  constructed  as 
part  of  the  Pargram  project  (Butt  et  al.,  2002),  to  the 
same  gold  standard.  This  system  incorporates  sophisti¬ 
cated  ambiguity-management  technology  so  that  all  pos¬ 
sible  syntactic  analyses  of  a  sentence  are  computed  in 
an  efficient,  packed  representation  (Maxwell  and  Ka¬ 
plan,  1993).  In  accordance  with  LEG  theory,  the  output 
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includes  not  only  standard  context-free  phrase-structure 
trees  but  also  attribute-value  matrices  (LFG’s  f(unctional) 
structures)  that  explicitly  encode  predicate-argument  re¬ 
lations  and  other  meaningful  properties.  XLE  selects  the 
most  probable  analysis  from  the  potentially  large  candi¬ 
date  set  by  means  of  a  stochastic  disambiguation  com¬ 
ponent  based  on  a  log-linear  (a.k.a.  maximum-entropy) 
probability  model  (Riezler  et  al.,  2002).  The  stochas¬ 
tic  component  is  also  “ambiguity-enabled”  in  the  sense 
that  the  computations  for  statistical  estimation  and  selec¬ 
tion  of  the  most  probable  analyses  are  done  efficiently 
by  dynamic  programming,  avoiding  the  need  to  unpack 
the  parse  forests  and  enumerate  individual  analyses.  The 
underlying  parsing  system  also  has  built-in  robustness 
mechanisms  that  allow  it  to  parse  strings  that  are  outside 
the  scope  of  the  grammar  as  a  shortest  sequence  of  well- 
formed  “fragments”.  Furthermore,  performance  parame¬ 
ters  that  bound  parsing  and  disambiguation  work  can  be 
tuned  for  efficient  but  accurate  operation. 

As  part  of  our  assessment,  we  also  measured  the  pars¬ 
ing  speed  of  the  two  systems,  taking  into  account  all 
stages  of  processing  that  each  system  requires  to  produce 
its  output.  For  example,  since  the  Collins  parser  depends 
on  a  prior  part-of-speech  tagger  (Ratnaparkhi,  1996),  we 
included  the  time  for  POS  tagging  in  our  Collins  mea¬ 
surements.  XFF  incorporates  a  sophisticated  finite-state 
morphology  and  dictionary  lookup  component,  and  its 
time  is  part  of  the  measure  of  XFE  performance. 

Performance  parameters  of  both  the  Collins  parser  and 
the  XFE  system  were  adjusted  on  a  heldout  set  consist¬ 
ing  of  a  random  selection  of  1/5  of  the  PARC  700  depen¬ 
dency  bank;  experimental  results  were  then  based  on  the 
other  560  sentences.  For  Model  3  of  the  Collins  parser,  a 
beam  size  of  1000,  and  not  the  recommended  beam  size 
of  10000,  was  found  to  optimize  parsing  speed  at  little 
loss  in  accuracy.  On  the  same  heldout  set,  parameters  of 
the  stochastic  disambiguation  system  and  parameters  for 
parsing  performance  were  adjusted  for  a  Core  and  a  Com¬ 
plete  version  of  the  XFE  system,  differing  in  the  size  of 
the  constraint-set  of  the  underlying  grammar. 

For  both  XFE  and  the  Collins  parser  we  wrote  con¬ 
version  programs  to  transform  the  normal  (tree  or  f- 
structure)  output  into  the  corresponding  relations  of 
the  dependency  bank.  This  conversion  was  relatively 
straightforward  for  FFG  structures  (King  et  al.,  2003). 
However,  a  certain  amount  of  skill  and  intuition  was 
required  to  provide  a  fair  conversion  of  the  Collins 
trees:  we  did  not  want  to  penalize  configurations  in  the 
Collins  trees  that  encoded  alternative  but  equally  legit¬ 
imate  representations  of  the  same  linguistic  properties 
(e.g.  whether  auxiliaries  are  encoded  as  main  verbs  or 
aspect  features),  but  we  also  did  not  want  to  build  into 
the  conversion  program  transformations  that  compensate 
for  information  that  Collins  cannot  provide  without  ap¬ 


pealing  to  additional  linguistic  resources  (such  as  identi¬ 
fying  the  subjects  of  infinitival  complements).  We  did  not 
include  the  time  for  dependency  conversion  in  our  mea¬ 
sures  of  performance. 

The  experimental  results  show  that  stochastic  parsing 
with  the  Core  FFG  grammar  achieves  a  better  F-score 
than  the  Collins  parser  at  a  roughly  comparable  parsing 
speed.  The  XFE  system  achieves  12%  reduction  in  error 
rate  over  the  Collins  parser,  that  is  77.6%  F-score  for  the 
XFE  system  versus  74.6%  for  the  Collins  parser,  at  a  cost 
in  parsing  time  of  a  factor  of  1.49. 

2  Stochastic  Parsing  with  LFG 

2.1  Parsing  with  Lexical-Functional  Grammar 

The  grammar  used  for  this  experiment  was  developed  in 
the  ParGram  project  (Butt  et  al.,  2002).  It  uses  LFG  as  a 
formalism,  producing  c(onstituent)-structures  (trees)  and 
f(unctional)-structures  (attribute  value  matrices)  as  out¬ 
put.  The  c-structures  encode  constituency  and  linear  or¬ 
der.  F-structures  encode  predicate-argument  relations  and 
other  grammatical  information,  e.g.,  number,  tense,  state¬ 
ment  type.  The  XLE  parser  was  used  to  produce  packed 
representations,  specifying  all  possible  grammar  analyses 
of  the  input. 

In  our  system,  tokenization  and  morphological  analy¬ 
sis  are  performed  by  finite-state  transductions  arranged  in 
a  compositional  cascade.  Both  the  tokenizer  and  the  mor¬ 
phological  analyzer  can  produce  multiple  outputs.  For  ex¬ 
ample,  the  tokenizer  will  optionaly  lowercase  sentence 
initial  words,  and  the  morphological  analyzer  will  pro¬ 
duce  walk  +Verb  +Pres  -i-5sg  and  walk  +Noun  +Pl  for 
the  input  form  walks.  The  resulting  tokenized  and  mor¬ 
phologically  analyzed  strings  are  presented  to  the  sym¬ 
bolic  LFG  grammar. 

The  grammar  can  parse  input  that  has  XML  de¬ 
limited  named  entity  markup:  <company>  Columbia 
Savings  </company>  is  a  major  holder  of  so-called  junk 
bonds.  To  allow  the  grammar  to  parse  this  markup, 
the  tokenizer  includes  an  additional  tokenization  of  the 
strings  whereby  the  material  between  the  XML  markup 
is  treated  as  a  single  token  with  a  special  morphologi¬ 
cal  tag  (H-NamedEntity).  As  a  fall  back,  the  tokenization 
that  the  string  would  have  received  without  that  markup 
is  also  produced.  The  named  entities  have  a  single  mul¬ 
tiword  predicate.  This  helps  in  parsing  both  because  it 
means  that  no  internal  structure  has  to  be  built  for  the 
predicate  and  because  predicates  that  would  otherwise  be 
unrecognized  by  the  grammar  can  be  parsed  (e.g.,  Cie. 
Financiere  de  Paribas).  As  described  in  section  5,  it  was 
also  important  to  use  named  entity  markup  in  these  ex¬ 
periments  to  more  fairly  match  the  analyses  in  the  PARC 
700  dependency  bank. 

To  increase  robustness,  the  standard  grammar  is  aug- 


merited  with  a  FRAGMENT  grammar.  This  allows  sen¬ 
tences  to  be  parsed  as  well-formed  chunks  specified  by 
the  grammar,  in  particular  as  Ss,  NPs,  PPs,  and  VPs,  with 
unparsable  tokens  possibly  interspersed.  These  chunks 
have  both  c-structures  and  f-structures  corresponding  to 
them.  The  grammar  has  a  fewest-chunk  method  for  de¬ 
termining  the  correct  parse. 

The  grammar  incorporates  a  version  of  Optimality 
Theory  that  allows  certain  (sub)rules  in  the  grammar  to  be 
prefered  or  disprefered  based  on  OT  marks  triggered  by 
the  (sub)rule  (Frank  et  ah,  1998).  The  Complete  version 
of  the  grammar  uses  all  of  the  (sub)rules  in  a  multi-pass 
system  that  depends  on  the  ranking  of  the  OT  marks  in 
the  rules.  For  example,  topic alization  is  disprefered,  but 
the  topicalization  rule  will  be  triggered  if  no  other  parse 
can  be  built.  A  one-line  rewrite  of  the  Complete  grammar 
creates  a  Core  version  of  the  grammar  that  moves  the  ma¬ 
jority  of  the  OT  marks  into  the  NOGOOD  space.  This  ef¬ 
fectively  removes  the  (sub)rules  that  they  mark  from  the 
grammar.  So,  for  example,  in  the  Core  grammar  there  is 
no  topicalization  rule,  and  sentences  with  topics  will  re¬ 
ceive  a  FRAGMENT  parse.  This  single-pass  Core  grammar 
is  smaller  than  the  Complete  grammar  and  hence  is  faster. 

The  XLE  parser  also  allows  the  user  to  adjust  per¬ 
formance  parameters  bounding  the  amount  of  work  that 
is  done  in  parsing  for  efficient  but  accurate  operation. 
XLE’s  ambiguity  management  technology  takes  advan¬ 
tage  of  the  fact  that  relatively  few  f-structure  constraints 
apply  to  constituents  that  are  far  apart  in  the  c-structure, 
so  that  sentences  are  typically  parsed  in  polynomial  time 
even  though  LEG  parsing  is  known  to  be  an  NP-complete 
problem.  But  the  worst-case  exponential  behavior  does 
begin  to  appear  for  some  constructions  in  some  sentences, 
and  the  computational  effort  is  limited  by  a  SKIMMING 
mode  whose  onset  is  controlled  by  a  user-specified  pa¬ 
rameter.  When  skimming,  XLE  will  stop  processing  the 
subtree  of  a  constituent  whenever  the  amount  of  work  ex¬ 
ceeds  that  user-specified  limit.  The  subtree  is  discarded, 
and  the  parser  will  move  on  to  another  subtree.  This  guar¬ 
antees  that  parsing  will  be  finished  within  reasonable  lim¬ 
its  of  time  and  memory  but  at  a  cost  of  possibly  lower 
accuracy  if  it  causes  the  best  analysis  of  a  constituent 
to  be  discarded.  As  a  separate  parameter,  XLE  also  lets 
the  user  limit  the  length  of  medial  constituents,  i.e.,  con¬ 
stituents  that  do  not  appear  at  the  beginning  or  the  end 
of  a  sentence  (ignoring  punctuation).  The  rationale  be¬ 
hind  this  heuristic  is  to  limit  the  weight  of  constituents  in 
the  middle  of  the  sentence  but  still  to  allow  sentence-final 
heavy  constituents.  This  discards  constituents  in  a  some¬ 
what  more  principled  way  as  it  tries  to  capture  the  psy- 
cholinguistic  tendency  to  avoid  deep  center-embedding. 
When  limiting  the  length  of  medial  constituents,  cubic¬ 
time  parsing  is  possible  for  sentences  up  to  that  length, 
even  with  a  deep,  non-context-free  grammar,  and  linear 


parsing  time  is  possible  for  sentences  beyond  that  length. 

The  Complete  grammar  achieved  100%  coverage  of 
section  23  as  unseen  unlabeled  data:  79%  as  full  parses, 
21%  FRAGMENT  and/or  SKIMMED  parses. 

2.2  Dynamic  Programming  for  Estimation  and 
Stochastic  Disambiguation 

The  stochastic  disambiguation  model  we  employ  defines 
an  exponential  (a.k.a.  log-linear  or  maximum-entropy) 
probability  model  over  the  parses  of  the  LEG  grammar. 
The  advantage  of  this  family  of  probability  distributions 
is  that  it  allows  the  user  to  encode  arbitrary  properties 
of  the  parse  trees  as  feature-functions  of  the  probability 
model,  without  the  feature-functions  needing  to  be  inde¬ 
pendent  and  non-overlapping.  The  general  form  of  con¬ 
ditional  exponential  models  is  as  follows: 

Pxix\y)  = 

where  Zx{y)  =  is  a  normalizing  con¬ 

stant  over  the  set  X{y)  of  parses  for  sentence  y,  A  is 
a  vector  of  log-parameters,  /  is  a  vector  of  feature- 
values,  and  A  •  f{x)  is  a  vector  dot  product  denoting  the 
(log-)weight  of  parse  x. 

Dynamic-programming  algorithms  that  allow  the  ef¬ 
ficient  estimation  and  searching  of  log-linear  mod¬ 
els  from  a  packed  parse  representation  without  enu¬ 
merating  an  exponential  number  of  parses  have 
been  recently  presented  by  Miyao  and  Tsujii  (2002) 
and  Geman  and  Johnson  (2002).  These  algorithms  can 
be  readily  applied  to  the  packed  and/or-forests  of 
Maxwell  and  Kaplan  (1993),  provided  that  each  conjunc¬ 
tive  node  is  annotated  with  feature-values  of  the  log- 
linear  model.  In  the  notation  of  Miyao  and  Tsujii  (2002), 
such  feature  forest  <I)  is  defined  as  a  tuple  (C,  D,  r,  7, 6) 
where  C  is  a  set  of  conjunctive  nodes,  U  is  a  set  of  dis¬ 
junctive  nodes,  r  G  C  is  the  root  node,  7  :  ^  2*^  is 

a  conjunctive  daughter  function,  and  <)  :  C  — >  2^  is  a 
disjunctive  daughter  function. 

A  dynamic -programming  solution  to  the  problem  of 
finding  most  probable  parses  is  to  compute  the  weight 
fd  of  each  disjunctive  node  as  the  maximum  weight  of 
its  conjunctive  daugher  nodes,  i.e., 

fd  =  max  fc  (1) 

cG7(d) 

and  to  recursively  define  the  weight  fc  of  a  conjunctive 
node  as  the  product  of  the  weights  of  all  its  descendant 
disjunctive  nodes  and  of  its  own  weight: 

<(-c=  n  (2) 

d^5{c) 

Keeping  a  trace  of  the  maximally  weighted  choices  in  a 
computaton  of  the  weight  of  the  root  conjunctive  node 


r  allows  us  to  efficiently  recover  the  most  probable  parse 
of  a  sentence  from  the  packed  representation  of  its  parses. 

The  same  formulae  can  be  employed  for  an  effi¬ 
cient  calculation  of  probabilistic  expectations  of  feature- 
functions  for  the  statistical  estimation  of  the  parameters 
A.  Replacing  the  maximization  in  equation  1  by  a  sum¬ 
mation  defines  the  inside  weight  of  disjunctive  node.  Cor¬ 
respondingly,  equation  2  denotes  the  inside  weight  of  a 
conjunctive  node.  The  outside  weight  ipc  of  a  conjunctive 
node  is  defined  as  the  outside  weight  of  its  disjunctive 
mother  node(s): 

V’c  =  X!  (3) 

{d|ce7((i)} 

The  outside  weight  of  a  disjunctive  node  is  the  sum  of 
the  product  of  the  outside  weight(s)  of  its  conjunctive 
mother(s),  the  weight(s)  of  its  mother(s),  and  the  inside 
weight(s)  of  its  disjunctive  sister(s): 

V'd  =  (4) 

{c|(iG<5{c)}  {d'\d'GS{c),d'^d} 


From  these  formulae,  the  conditional  expectation  of  a 
feature-function  fi  can  be  computed  from  a  chart  with 
root  node  r  for  a  sentence  y  in  the  following  way: 


E 

xGX(y) 


Zx{y) 


E 

cec 


4*r 


(5) 


Formula  5  is  used  in  our  system  to  compute  expectations 
for  discriminative  Bayesian  estimation  from  partially  la¬ 
beled  data  using  a  first-order  conjugate-gradient  routine. 
For  a  more  detailed  description  of  the  optimization  prob¬ 
lem  and  the  feature-functions  we  use  for  stochastic  LFG 
parsing  see  Riezler  et  al.  (2002).  We  also  employed  a 
combined  £i  regularization  and  feature  selection  tech¬ 
nique  described  in  Riezler  and  Vasserman  (2004)  that 
considerably  speeds  up  estimation  and  guarantees  small 
feature  sets  for  stochastic  disambiguation.  In  the  experi¬ 
ments  reported  in  this  paper,  however,  dynamic  program¬ 
ming  is  crucial  for  efficient  stochastic  disambiguation, 
i.e.  to  efficiently  find  the  most  probable  parse  from  a 
packed  parse  forest  that  is  annotated  with  feature- values. 
There  are  two  operations  involved  in  stochastic  disam¬ 
biguation,  namely  calculating  feature-values  from  a  parse 
forest  and  calculating  node  weights  from  a  feature  forest. 
Clearly,  the  first  one  is  more  expensive,  especially  for 
the  extraction  of  values  for  non-local  feature-functions 
over  large  charts.  To  control  the  cost  of  this  compu¬ 
tation,  our  stochastic  disambiguation  system  includes 
a  user-specified  parameter  for  bounding  the  amount  of 
work  that  is  done  in  calculating  feature-values.  When  the 
user-specified  threshold  for  feature-value  calculation  is 
reached,  this  computation  is  discontinued,  and  the  dy¬ 
namic  programming  calculation  for  most-probable-parse 


search  is  computed  from  the  current  feature-value  anno¬ 
tation  of  the  parse  forest.  Since  feature-value  computa¬ 
tion  proceeds  incrementally  over  the  feature  forest,  i.e. 
for  each  node  that  is  visited  all  feature-functions  that  ap¬ 
ply  to  it  are  evaluated,  a  complete  feature  annotation  can 
be  guaranteed  for  the  part  of  the  and/or-forest  that  is  vis¬ 
ited  until  discontinuation.  As  discussed  below,  these  pa¬ 
rameters  were  set  on  a  held-out  portion  of  the  PARC700 
which  was  also  used  to  set  the  Collins  parameters. 

In  the  experiments  reported  in  this  paper,  we  used  a 
threshold  on  feature-extraction  that  allowed  us  to  cut  off 
feature-extraction  in  3%  of  the  cases  at  no  loss  in  accu¬ 
racy.  Overall,  feature  extraction  and  weight  calculation 
accounted  for  5%  of  the  computation  time  in  combined 
parsing  and  stochastic  selection. 

3  The  Gold- Standard  Dependency  Bank 

We  used  the  PARC  700  Dependency  Bank  (depbank) 
as  the  gold  standard  in  our  experiments.  The  DEPBANK 
consists  of  dependency  annotations  for  700  sentences  that 
were  randomly  extracted  from  section  23  of  the  UPenn 
Wall  Street  Journal  (WSJ)  treebank.  As  described  by 
(King  et  al.,  2003),  the  annotations  were  boot-strapped 
by  parsing  the  sentences  with  a  LFG  grammar  and  trans¬ 
forming  the  resulting  f-structures  to  a  collection  of  depen¬ 
dency  triples  in  the  DEPBANK  format.  To  prepare  a  true 
gold  standard  of  dependencies,  the  tentative  set  of  depen¬ 
dencies  produced  by  the  robust  parser  was  then  corrected 
and  extended  by  human  validators^.  In  this  format  each 
triple  specifies  that  a  particular  relation  holds  between  a 
head  and  either  another  head  or  a  feature  value,  for  ex¬ 
ample,  that  the  SUBJ  relation  holds  between  the  heads 
run  and  dog  in  the  sentence  The  dog  ran.  Average  sen¬ 
tence  length  of  sentences  in  DEPBANK  is  19.8  words,  and 
the  average  number  of  dependencies  per  sentence  is  65.4. 
The  corpus  is  freely  available  for  research  and  evaluation, 
as  are  documentation  and  tools  for  displaying  and  prun¬ 
ing  structures.^ 

In  our  experiments  we  used  a  Reduced  version  of  the 
DEPBANK,  including  just  the  minimum  set  of  dependen¬ 
cies  necessary  for  reading  out  the  central  semantic  rela¬ 
tions  and  properties  of  a  sentence.  We  tested  against  this 
Reduced  gold  standard  to  establish  accuracy  on  a  lower 
bound  of  the  information  that  a  meaning-sensitive  appli¬ 
cation  would  require.  The  Reduced  version  contained  all 
the  argument  and  adjunct  dependencies  shown  in  Fig. 
1,  and  a  few  selected  semantically-relevant  features,  as 
shown  in  Fig.  2.  The  features  in  Fig.  2  were  chosen  be- 

^The  resulting  test  set  is  thus  unseen  to  the  grammar  and 
stochastic  disambiguation  system  used  in  our  experiments.  This 
is  indicated  by  the  fact  that  the  upperbound  of  F-score  for  the 
best  matching  parses  for  the  experiment  grammar  is  in  the  range 
of  85%,  not  100%. 

^http://www2.parc.com/istl/groups/nltt/fsbank/ 


Function 

Meaning 

adjunct 

adjuncts 

aquant 

adjectival  quantifiers  (many,  etc.) 

comp 

complement  clauses  (that,  whether) 

conj 

conjuncts  in  coordinate  structures 

focusTnt 

fronted  element  in  interrogatives 

mod 

noun-noun  modifiers 

number 

numbers  modifying  nouns 

obj 

objects 

objJheta 

secondary  objects 

obi 

oblique 

obLag 

demoted  subject  of  a  passive 

obLcompar 

comparative  than/as  clauses 

poss 

possessives  {John’s  book) 

pronjnt 

interrogative  pronouns 

pronjel 

relative  pronouns 

quant 

quantifiers  (all,  etc.) 

subj 

subjects 

topic  _rel 

fronted  element  in  relative  clauses 

xcomp 

non-finite  complements 
verbal  and  small  clauses 

Figure  1:  Grammatical  functions  in  DEPBANK. 

cause  it  was  felt  that  they  were  fundamental  to  the  mean¬ 
ing  of  the  sentences,  and  in  fact  they  are  required  by  the 
semantic  interpreter  we  have  used  in  a  knowledge-based 
application  (Crouch  et  ah,  2002). 

Feature  Meaning 

adegree  degree  of  adjectives  and  adverbs 

(positive,  comparative,  superlative) 
coord_form  form  of  a  coordinating 

conjunction  (e.g.,  and,  or) 
det_form  form  of  a  determiner  (e.g.,  the,  a) 

num  number  of  nouns  (sg,  pi) 

number_type  cardinals  vs.  ordinals 

passive  passive  verb  (e.g..  It  was  eaten.) 

perf  perfective  verb  (e.g.,  have  eaten) 

precoord_form  either,  neither 

prog  progressive  verb  (e.g.,  were  eating) 

pron_form  form  of  a  pronoun  (he,  she,  etc.) 

prt_form  particle  in  a  particle  verb 

(e.g..  They  threw  it  out.) 
stmtJype  statement  type  (declarative, 

interrogative,  etc.) 

subord_form  subordinating  conjunction  (e.g.  that) 

tense  tense  of  the  verb  (past,  present,  etc.) 

Figure  2:  Selected  features  for  Reduced  DEPBANK 

As  a  concrete  example,  the  dependency  list  in  Fig.  3  is 
the  Reduced  set  corresponding  to  the  following  sentence: 

He  reiterated  his  opposition  to  such  funding, 
but  expressed  hope  of  a  compromise. 

An  additional  feature  of  the  DEPBANK  that  is  relevant 
to  our  comparisons  is  that  dependency  heads  are  rep¬ 
resented  by  their  standard  citation  forms  (e.g.  the  verb 
swam  in  a  sentence  appears  as  swim  in  its  dependencies). 


We  believe  that  most  applications  will  require  a  conver¬ 
sion  to  canonical  citation  forms  so  that  semantic  relations 
can  be  mapped  into  application-specific  databases  or  on¬ 
tologies.  The  predicates  of  LFG  f-structures  are  already 
represented  as  citation  forms;  for  a  fair  comparison  we 
ran  the  leaves  of  the  Collins  tree  through  the  same  stem- 
mer  modules  as  part  of  the  tree-to-dependency  transla¬ 
tion.  We  also  note  that  proper  names  appear  in  the  DEP¬ 
BANK  as  single  multi-word  expressions  without  any  in¬ 
ternal  structure.  That  is,  there  are  no  dependencies  hold¬ 
ing  among  the  parts  of  people  names  (A.  Boyd  Simpson), 
company  names  {Goldman,  Sachs  &  Co),  and  organiza¬ 
tion  names  {Federal  Reserve).  This  multiword  analysis 
was  chosen  because  many  applications  do  not  require 
the  internal  structure  of  names,  and  the  identification  of 
named  entities  is  now  typically  carried  out  by  a  separate 
non-syntactic  pre-processing  module.  This  was  captured 
for  the  LFG  parser  by  using  named  entity  markup  and  for 
the  Collins  parser  by  creating  complex  word  forms  with 
a  single  POS  tag  (section  5). 

conj(coord~0,  express~3) 
conj(coord~0,  reiterate~l) 
coord_form(coord~0,  but) 
stmt_type(coord~0,  declarative) 
obj(reiterate~l,  opposition~6) 
subj(reiterate~l,  pro~7) 
tense(reiterate~l,  past) 
obj(express~3,  hope~15) 
subj(express~3,  pro~7) 
tense(express~3,  past) 
adjunct(opposition~6,  to~ll) 
num(opposition~6,  sg) 
poss(opposition~6,  pro~19) 
num(pro~7,  sg) 
pron_form(pro~7,  he) 
obj(to~ll,  funding~13) 
adjunct(funding~13,  such~45) 
num(funding~13,  sg) 
adjunct(hope~15,  of~46) 
num(hope~15,  sg) 
num(pro~19,  sg) 
pron_form(pro~19,  he) 
adegree(such~45,  positive) 
obj(of~46,  compromise~54) 
det_form(compromise~54,  a) 
num(compromise~54,  sg) 

Figure  3:  Reduced  dependency  relations  for  He  reiterated 
his  opposition  to  such  funding,  but  expressed  hope  of  a 
compromise. 

4  Conversion  to  Dependency  Bank  Format 

A  conversion  routine  was  required  for  each  system  to 
transform  its  output  so  that  it  could  be  compared  to  the 
DEPBANK  dependencies.  While  it  is  relatively  straightfor¬ 
ward  to  convert  LFG  f-structures  to  the  dependency  bank 


format  because  the  f-stmcture  is  effectively  a  dependency 
format,  it  is  more  difficult  to  transform  the  output  trees  of 
the  Model  3  Collins  parser  in  a  way  that  fairly  allocates 
both  credits  and  penalties. 

LFG  Conversion  We  discarded  the  LFG  tree  structures 
and  used  a  general  rewriting  system  previously  developed 
for  machine  translation  to  rewrite  the  relevant  f-structure 
attributes  as  dependencies  (see  King  et  al.  (2003)).  The 
rewritings  involved  some  deletions  of  irrelevant  features, 
some  systematic  manipulations  of  the  analyses,  and  some 
trivial  respellings.  The  deletions  involved  features  pro¬ 
duced  by  the  grammar  but  not  included  in  the  PARC  700 
such  as  negative  values  of  PASS,  PERF,  and  PROG  and 
the  feature  MEASURE  used  to  mark  measure  phrases.  The 
manipulations  are  more  interesting  and  are  necessary  to 
map  systematic  differences  between  the  analyses  in  the 
grammar  and  those  in  the  dependency  bank.  For  example, 
coordination  is  treated  as  a  set  by  the  LFG  grammar  but  as 
a  single  COORD  dependency  with  several  CONJ  relations 
in  the  dependency  bank.  Finally,  the  trivial  rewritings 
were  used  to,  for  example,  change  STMT-TYPE  decl  in 
the  grammar  to  STMT-TYPE  declarative  in  the  de¬ 
pendency  bank.  For  the  Reduced  version  of  the  PARC 
700  substantially  more  features  were  deleted. 

Collins  Model  3  Conversion  An  abbreviated  represen¬ 
tation  of  the  Collins  tree  for  the  example  above  is  shown 
in  Fig.  4.  In  this  display  we  have  eliminated  the  head  lex¬ 
ical  items  that  appear  redundantly  at  all  the  nonterminals 
in  a  head  chain,  instead  indicating  by  a  single  number 
which  daughter  is  the  head.  Thus,  S~2  indicates  that  the 
head  of  the  main  clause  is  its  second  daughter,  the  VP, 
and  its  head  is  its  first  VP  daughter.  Indirectly,  then,  the 
lexical  head  of  the  S  is  the  first  verb  reiterated. 

(TOP~l 

(S~2  (NP-A~1  (NPB~1  He/PRP)) 

(VP~1  (VP~1  reiteratedWBD 

(NP-A~1  (NPB~2  his/PRP$ 
opposition/NN) 

(PP~1  to/TO 

(NPB~2  such/JJ 
funding/NN)))) 

but/CC 

(VP~1  expressed/VBD 

(NP-A~1  (NPB~1  hope/NN) 

(PP~1  of/IN 

(NP-A~1  (NPB~2  a/DT 
compromise/NN)))))))) 

Figure  4:  Collins  Model  3  tree  for  He  reiterated  his  op¬ 
position  to  such  funding,  but  expressed  hope  of  a  compro¬ 
mise. 

The  Model  3  output  in  this  example  includes  standard 
phrase  structure  categories,  indications  of  the  heads,  and 


the  additional  -A  marker  to  distinguish  arguments  from 
adjuncts.  The  terminal  nodes  of  this  tree  are  inflected 
forms,  and  the  first  phase  of  our  conversion  replaces  them 
with  their  citation  forms  (the  verbs  reiterate  and  express, 
and  the  decapitalized  and  standardized  he  for  He  and  his). 
We  also  adjust  for  systematic  differences  in  the  choice  of 
heads.  The  first  conjunct  tends  to  be  marked  as  the  head 
of  a  coordination  in  Model  3  output,  whereas  the  depen¬ 
dency  bank  has  a  more  symmetric  representation;  it  in¬ 
troduces  a  new  COORD  head  and  connects  that  up  to  the 
conjunction,  and  it  uses  a  separate  CONJ  relation  for  each 
of  the  coordinated  items.  Similarly,  Model  3  identifies 
the  syntactic  markers  to  and  that  as  the  heads  of  com¬ 
plements,  whereas  the  dependency  bank  treats  these  as 
selectional  features  and  marks  the  main  predicate  of  the 
complements  as  the  head.  These  adjustments  are  carried 
out  without  penalty.  We  also  compensate  for  the  differ¬ 
ences  in  the  representation  of  auxiliaries:  Model  3  treats 
these  as  main  verbs  with  embedded  complements  instead 
of  the  PERE,  PROG,  and  PASSIVE  features  of  the  DEP- 
BANK,  and  our  conversion  flattens  the  trees  so  that  the 
features  can  be  read  off. 

The  dependencies  are  read  off  after  these  and  a  few 
other  adjustments  are  made.  NPs  under  VPs  are  read  off 
either  as  objects  or  adjuncts,  depending  on  whether  or 
not  the  NP  is  annotated  with  the  argument  indicator  (-A) 
as  in  this  example;  the  -A  presumably  would  be  miss¬ 
ing  in  a  sentence  like  John  arrived  Friday,  and  Friday 
would  be  treated  as  an  ADJUNCT.  Similarly,  NP-As  un¬ 
der  S  are  read  off  as  subject.  In  this  example,  however, 
this  principle  of  conversion  does  not  lead  to  a  match  with 
the  dependency  bank:  in  the  DEPBANK  grammatical  rela¬ 
tions  that  are  factored  out  of  conjoined  structures  are  dis¬ 
tributed  back  into  those  structures,  to  establish  the  correct 
semantic  dependencies  (in  this  case,  that  he  is  the  subject 
of  both  reiterate  and  express  and  not  of  the  introduced 
coord).  We  avoided  the  temptation  of  building  coordinate 
distribution  into  the  conversion  routine  because,  first,  it  is 
not  always  obvious  from  the  Model  3  output  when  dis¬ 
tribution  should  take  place,  and  second,  that  would  be 
a  first  step  towards  building  into  the  conversion  routine 
the  deep  lexical  and  syntactic  knowledge  (essentially  the 
functional  component  of  our  LFG  grammar)  that  the  shal¬ 
low  approach  explicitly  discounts'*. 

For  the  same  reasons  our  conversion  routine  does  not 
identify  the  subjects  of  infinitival  complements  with  par¬ 
ticular  arguments  of  matrix  verbs.  The  Model  3  trees  pro¬ 
vide  no  indication  of  how  this  is  to  be  done,  and  in  many 
cases  the  proper  assignment  depends  on  lexical  informa¬ 
tion  about  specific  predicates  (to  capture,  for  example,  the 
well-known  contrast  between  promise  and  persuade). 

Model  3  trees  also  provide  information  about  certain 

*  However,  we  did  explore  a  few  of  these  additional  transfor¬ 
mations  and  found  only  marginal  F-score  increases. 


long-distance  dependencies,  by  marking  with  -g  annota¬ 
tions  the  path  between  a  filler  and  a  gap  and  marking  the 
gap  by  an  explicit  TRACE  in  the  terminal  string.  The  filler 
itself  is  not  clearly  identified,  but  our  conversion  treats 
all  WH  categories  under  SBAR  as  potential  fillers  and 
attempts  to  propagate  them  down  the  gap-chain  to  link 
them  up  to  appropriate  traces. 

In  sum,  it  is  not  a  trivial  matter  to  convert  a  Model  3 
tree  to  an  appropriate  set  of  dependency  relations,  and  the 
process  requires  a  certain  amount  of  intuition  and  skill. 
For  our  experiments  we  tried  to  define  a  conversion  that 
gives  appropriate  credit  to  the  dependencies  that  can  be 
read  from  the  trees  without  relying  on  an  undue  amount 
of  sophisticated  linguistic  knowledge^. 

5  Experiments 

We  conducted  our  experiments  by  preparing  versions  of 
the  test  sentences  in  the  form  appropriate  to  each  sys¬ 
tem.  We  used  a  configuration  of  the  XLE  parser  that  ex¬ 
pects  sentences  conforming  to  ordinary  text  conventions 
to  appear  in  a  file  separated  by  double  line-feeds.  A  cer¬ 
tain  amount  of  effort  was  required  to  remove  the  part-of- 
speech  tags  and  labeled  brackets  of  the  WSJ  corpus  in  a 
way  that  restored  the  sentences  to  a  standard  English  for¬ 
mat  (for  example,  to  remove  the  space  between  wo  and  n  ’t 
that  remains  when  the  POS  tags  are  removed).  Since  the 
PARC  700  treats  proper  names  as  multiword  expressions, 
we  then  augmented  the  input  strings  with  XML  markup 
of  the  named  entities.  These  are  parsed  by  the  grammar 
as  described  in  section  2.  We  used  manual  named  entity 
markup  for  this  experiment  because  our  intent  is  to  mea¬ 
sure  parsing  technology  independent  of  either  the  time 
or  errors  of  an  automatic  named-entity  extractor.  How¬ 
ever,  in  other  experiments  with  an  automatic  finite-state 
extractor,  we  have  found  that  the  time  for  named-entity 
recognition  is  negligible  (on  the  order  of  seconds  across 
the  entire  corpus)  and  makes  relatively  few  errors,  so  that 
the  results  reported  here  are  good  approximations  of  what 
might  be  expected  in  more  realistic  situations. 

As  input  to  the  Collins  parser,  we  used  the  part-of- 
speech  tagged  version  of  section  23  that  was  provided 
with  the  parser.  Erom  this  we  extracted  the  700  sentences 
in  the  PARC  700.  We  then  modified  them  to  produce 
named  entity  input  so  that  the  parses  would  match  the 
PARC  700.  This  was  done  by  putting  underscores  be¬ 
tween  the  parts  of  the  named  entity  and  changing  the  final 
part  of  speech  tag  to  the  appropriate  one  (usually  NNP) 
if  necessary.  (The  number  of  words  indicated  at  the  be¬ 
ginning  of  the  input  string  was  also  reduced  accordingly.) 
An  example  is  shown  in  (1). 


^The  results  of  this  conversion  are  available  at 
http://www2.parc.com/istl/groups/nltt/fsbank/ 


(1)  Sen.  NNP  Christopher  NNP  Dodd  NNP  — > 
Sen._Christopher_Dodd  NNP 

After  parsing,  the  underscores  were  converted  to  spaces 
to  match  the  PARC  700  predicates. 

Before  the  final  evaluation,  1/5  of  the  PARC  700  de¬ 
pendency  bank  was  randomly  extracted  as  a  heldout  set. 
This  set  was  used  to  adjust  the  performance  parameters  of 
the  XLE  system  and  the  Collins  parser  so  as  to  optimize 
parsing  speed  without  losing  accuracy.  Eor  example,  the 
limit  on  the  length  of  medial  phrases  was  set  to  20  words 
for  the  XLE  system  (see  Sec.  2),  and  a  regularizer  penalty 
of  10  was  found  optimal  for  the  prior  used  in  stochas¬ 
tic  disambiguation.  Eor  the  Collins  parser,  a  beam  size 
of  1000  was  found  to  improve  speed  considerably  at  lit¬ 
tle  cost  in  accuracy.  Eurthermore,  the  np-bracketing  flag 
(npbflag)  was  set  to  0  to  produce  an  extended  set  of  NP 
levels  for  improved  argument/adjunct  distinction®.  The  fi¬ 
nal  evaluation  was  done  on  the  remaining  560  examples. 
Timing  results  are  reported  in  seconds  of  CPU  time^.  POS 
tagging  of  the  input  to  the  Collins  parser  took  6  seconds 
and  this  was  added  to  the  timing  result  of  the  Collins 
parser.  Time  spent  for  finite-state  morphology  and  dictio¬ 
nary  lookup  for  XLE  is  part  of  the  measure  of  its  timing 
performance.  We  did  not  include  the  time  for  dependency 
extraction  or  stemming  the  Collins  output. 

Table  1  shows  timing  and  accuracy  results  for  the  Re¬ 
duced  dependency  set.  The  parser  settings  compared  are 
Model  3  of  the  Collins  parser  adjusted  to  beam  size  1000, 
and  the  Core  and  Complete  versions  of  the  XLE  sys¬ 
tem,  differing  in  the  size  of  the  grammar’s  constraint- 
set.  Clearly,  both  versions  of  the  XLE  system  achieve  a 
significant  reduction  in  error  rate  over  the  Collins  parser 
(12%  for  the  core  XLE  system  and  20%  for  the  complete 
system)  at  an  increase  in  parsing  time  of  a  factor  of  only 
1 .49  for  the  core  XLE  system.  The  complete  version  gives 
an  overall  improvement  in  E-score  of  5%  over  the  Collins 
parser  at  a  cost  of  a  factor  of  5  in  parsing  time. 


Table  1:  Timing  and  accuracy  results  for  Collins  parser 
and  Complete  and  Core  versions  of  XLE  system  on  Re¬ 
duced  version  of  PARC  700  dependency  bank. 


time 

prec. 

rec. 

E-score 

LEG  core 

298.88 

79.1 

76.2 

77.6 

LEG  complete 

985.3 

79.4 

79.8 

79.6 

Collins  1000 

199.6 

78.3 

71.2 

74.6 

®A  beam  size  of  10000  as  used  in  Collins  (1999)  improved 
the  F-score  on  the  heldout  set  only  by  .1%  at  an  increase  of  pars¬ 
ing  time  by  a  factor  of  3.  Beam  sizes  lower  than  1000  decreased 
the  heldout  F-score  significantly. 

’All  experiments  were  run  on  one  CPU  of  a  dual  proces¬ 
sor  AMD  Opteron  244  with  1.8  GHz  and  4GB  main  memory. 
Loading  times  are  included  in  CPU  times. 


6  Conclusion 

We  presented  some  experiments  that  compare  the  accu¬ 
racy  and  performance  of  two  stochastic  parsing  systems, 
the  shallow  Collins  parser  and  the  deep-grammar-based 
XLE  system.  We  measured  the  accuracy  of  both  systems 
against  a  gold  standard  derived  from  the  PARC  700  de¬ 
pendency  bank,  and  also  measured  their  processing  times. 
Contrary  to  conventional  wisdom,  we  found  that  the  shal¬ 
low  system  was  not  substantially  faster  than  the  deep 
parser  operating  on  a  core  grammar,  while  the  deep  sys¬ 
tem  was  significantly  more  accurate.  Furthermore,  ex¬ 
tending  the  grammar  base  of  the  deep  system  results  in 
much  better  accuracy  at  a  cost  of  a  factor  of  5  in  speed. 

Our  experiment  is  comparable  to  recent  work  on  read¬ 
ing  off  Propbank-style  (Kingsbury  and  Palmer,  2002) 
predicate-argument  relations  from  gold-standard  tree- 
bank  trees  and  automatic  parses  of  the  Collins  parser. 
Gildea  and  Palmer  (2002)  report  F-score  results  in  the 
55%  range  for  argument  and  boundary  recognition  based 
on  automatic  parses.  From  this  perspective,  the  nearly 
75%  F-score  that  is  achieved  for  our  deterministic  rewrit¬ 
ing  of  Collins’  trees  into  dependencies  is  remarkable, 
even  if  the  results  are  not  directly  comparable.  Our  scores 
and  Gildea  and  Palmer’s  are  both  substantially  lower  than 
the  90%  typically  cited  for  evaluations  based  on  labeled 
or  unlabeled  bracketing,  suggesting  that  extracting  se¬ 
mantically  relevant  dependencies  is  a  more  difficult,  but 
we  think  more  valuable,  task. 
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